Environment Setup¶

In [42]:
# General Imports
import numpy as np
import pandas as pd
import geopandas as gpd
import os
import glob
import folium
import json
import branca.colormap as cmp
import plotly.graph_objects as go
from datetime import datetime
from copy import deepcopy
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
In [2]:
import plotly.offline as pyo
pyo.init_notebook_mode()

Data Loading¶

In [3]:
history_data = pd.read_csv('data/EV_Data/Electric_Vehicle_Population_Size_History_By_County.csv')
history_data.head()
Out[3]:
Date County State Vehicle Primary Use Battery Electric Vehicles (BEVs) Plug-In Hybrid Electric Vehicles (PHEVs) Electric Vehicle (EV) Total Non-Electric Vehicle Total Total Vehicles Percent Electric Vehicles
0 October 31 2019 Flathead MT Passenger 1 0 1 70 71 1.41
1 October 31 2019 New London CT Passenger 0 2 2 244 246 0.81
2 June 30 2019 Pend Oreille WA Truck 0 0 0 5730 5730 0.00
3 November 30 2019 Virginia Beach VA Passenger 1 0 1 706 707 0.14
4 February 28 2018 Sacramento CA Passenger 1 0 1 307 308 0.32
In [4]:
vehicle_data = pd.read_csv('data/EV_Data/Electric_Vehicle_Population_Data.csv')
vehicle_data.head()
Out[4]:
VIN (1-10) County City State Postal Code Model Year Make Model Electric Vehicle Type Clean Alternative Fuel Vehicle (CAFV) Eligibility Electric Range Base MSRP Legislative District DOL Vehicle ID Vehicle Location Electric Utility 2020 Census Tract
0 5YJYGDEF5L Thurston Lacey WA 98516.0 2020 TESLA MODEL Y Battery Electric Vehicle (BEV) Clean Alternative Fuel Vehicle Eligible 291 0 22.0 124535071 POINT (-122.7474291 47.0821119) PUGET SOUND ENERGY INC 5.306701e+10
1 1N4BZ1CP1K King Sammamish WA 98074.0 2019 NISSAN LEAF Battery Electric Vehicle (BEV) Clean Alternative Fuel Vehicle Eligible 150 0 45.0 102359449 POINT (-122.0313266 47.6285782) PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA) 5.303303e+10
2 5YJXCDE28G King Kent WA 98031.0 2016 TESLA MODEL X Battery Electric Vehicle (BEV) Clean Alternative Fuel Vehicle Eligible 200 0 33.0 228682037 POINT (-122.2012521 47.3931814) PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA) 5.303303e+10
3 JHMZC5F37M Kitsap Poulsbo WA 98370.0 2021 HONDA CLARITY Plug-in Hybrid Electric Vehicle (PHEV) Clean Alternative Fuel Vehicle Eligible 47 0 23.0 171566447 POINT (-122.64177 47.737525) PUGET SOUND ENERGY INC 5.303509e+10
4 WA1F2AFY4P Thurston Olympia WA 98501.0 2023 AUDI Q5 E Plug-in Hybrid Electric Vehicle (PHEV) Not eligible due to low battery range 23 0 22.0 234923230 POINT (-122.89692 47.043535) PUGET SOUND ENERGY INC 5.306701e+10
In [5]:
us_outlines = {}

# Read in all geo
for geo_json in glob.glob('data/Geo_Data/*.json'):
    with open(geo_json, 'r', encoding='utf-8', errors='ignore') as geo_data:
        us_outlines[os.path.basename(geo_json)[:-5]] = (json.load(geo_data))

Preprocessing and Data Extraction¶

To start, we must drop any rows with missing (NaN) data.

In [6]:
# Drop nan's
history_data.dropna(inplace=True)
print(f'Num of history_data NaNs: %d' % sum([history_data[col].isna().sum() for col in history_data.columns]))

vehicle_data.dropna(inplace=True)
print(f'Num of vehicle_data NaNs: %d' % sum([vehicle_data[col].isna().sum() for col in vehicle_data.columns]))
Num of history_data NaNs: 0
Num of vehicle_data NaNs: 0

The GeoJson datasets contain geo data on all relevant locations in the U.S. We must extract the counties specific to Washington for our analysis. The GeoJson for the counties uses the state number to represent the states, but our main datasets don't use this. We will need to extract this value from the states GeoJson, then use this id to extract only the counties in Washington.

In [7]:
# Get Washington's state id
for state in us_outlines['states']['features']:
    if state['properties']['NAME'] == 'Washington':
        wa_id = state['properties']['STATE']
        break
        
print(f'Washington State ID: {wa_id}')
Washington State ID: 53
In [8]:
# Get geo data of all counties in washington
wa_counties = []
for county in us_outlines['counties']['features']:
    if county['properties']['STATE'] == '53':
        wa_counties.append(county)
wa_county_outlines = deepcopy(us_outlines['counties'])
wa_county_outlines['features'] = wa_counties
print(f'Number of counties: %d' % len(wa_counties))
Number of counties: 39

Each row of vehicle_data is an individual vehicle with information on where it resides. A new dataframe must be made to hold the numerical counts of EVs in general per county to use in our choropleths

In [9]:
# Get count of ev's in each county
total_ev_per_county = pd.DataFrame(vehicle_data['County'].value_counts())
total_ev_per_county.rename(columns={'County': 'Count'}, inplace=True)
total_ev_per_county.head()
Out[9]:
Count
King 84940
Snohomish 19012
Pierce 12575
Clark 9595
Thurston 5880

We will transform the date in history_data into DateTime to extract date/time parts more easily.

In [10]:
import calendar

# Get month name to number mapper
month_mapper = {month_name: month_num for month_num, month_name in enumerate(calendar.month_name)}
month_mapper.pop('')
month_mapper

# Transform date string (if it is a string) to datetime object
if pd.api.types.is_string_dtype(history_data['Date'].dtype):
    history_data['Date'] = history_data['Date'].str.split().apply(lambda x: datetime.strptime(
                                                                    '/'.join([str(month_mapper[x[0]])] + x[1:]),
                                                                    '%m/%d/%Y'
                                                                ))

Analysis¶

The data for each county can be assumed to be approximately the population data as the dataset source is described to show all the EVs "that are currently registered through Washington State Department of Licensing (DOL)".

In [28]:
# Get year and count number of vehicles registered **per** year
register_per_year = dict(history_data['Date'].dt.year.value_counts())
register_per_year = {year: register_per_year[year] for year in sorted(register_per_year)}

# Compute cumulative sum of vehicle registered in each year
cum_year_count = {}
cumsum = 0
for year in register_per_year:
    cum_year_count[year] = cumsum + register_per_year[year]
    cumsum = cum_year_count[year]

# Plot Cumulative sum of vehicle registered in each year
fig = go.Figure()
fig.add_traces(go.Scatter(x=list(cum_year_count.keys()), 
                                               y=list(cum_year_count.values())
                                              )
                                   )
fig.show()

# Compute change in count of vehicle registration per year
annual_registration_change = {}
prev_year = ('0000', 0)
for year in register_per_year:
    annual_registration_change[f'{prev_year[0]} to {year}'] = register_per_year[year] - prev_year[1]
    prev_year = (year, register_per_year[year])

# Plot change in a bar plot, omitting first change (no change before 2017)
fig = go.Figure()
fig.add_trace(go.Bar(
        x=list(annual_registration_change.keys())[1:],
        y=list(annual_registration_change.values())[1:],
        text=list(annual_registration_change.values())[1:],
        textposition='auto',
))

# Change the bar mode to have negative change under 0
fig.update_layout(barmode='relative', title_text='Change in # of Registration of EVs Between Years')
fig.show()

Total EVs owned per year is more or less consistent from 2017 to 2023. At first glance, the scatter plot provides the false implication that a simple linear regression is a safe option to predict the likely number of EVs in the coming years. However, if look at the change in the rate at which we are registering new EVs per year, we actually see there is a negative trend in the rate.
In the change from 2022 to 2023, we see a drastic difference from previous changes. Upon further observation, it seems this is due to missing data for December 2023. In order to fix this, we'll need to zoom in on the months to obtain a likely register count for December 2023.

In [13]:
# Collect count per month per year
year_month_count = {year: {month: 0 for month in range(1, 13)} for year in register_per_year}
for index, row in history_data.iterrows():
    year_month_count[row['Date'].year][row['Date'].month] += 1

# Plot months in all years on same plot
fig = go.Figure()

for year in year_month_count:
    fig.add_traces(go.Scatter(x=list(year_month_count[year].keys()),
                                             y=list(year_month_count[year].values()),
                                             name=year
                                            )
                                 )

fig.show()

Let's also look at the mean, variance, and standard deviation. Recall this dataset is representative of the population of washington, so we can consider them to be the population mean, variance, and standard deviation ($\mu$, $\sigma^2$, $\sigma$).

In [14]:
def compute_vis_month_stat(year_month_count):
    '''
    Computes the mean, variance, and standard deviation and plots a box plot and distribution plot for them.
    '''
    # Collect counts of each year index by months (- 1) instead
    month_count_list = {month: [] for month in range(1, 13)}
    for year in year_month_count:
        [month_count_list[month].append(year_month_count[year][month]) for month in month_count_list]
        
    # Compute mean and variance per year
    months_mv = {month: (np.mean(month_count_list[month]),
                         np.var(month_count_list[month]),
                         np.std(month_count_list[month])) for month in month_count_list}
        
    # Plot box plot of the distribution of Mean count of each month
    fig = go.Figure()
    for month in month_count_list:
        fig.add_trace(go.Box(y=month_count_list[month], name=month, marker_color='red'))
    fig.update_layout(showlegend=False,
                                  title_text='Distribution of the Mean Count of Each Month (Boxplot)')
    fig.show()

    # Plot normal distplot without histogram of the distribution of Mean count of each month
    import plotly.figure_factory as ff
    ffig = ff.create_distplot([[months_mv[month][0] for month in months_mv]], group_labels=['Mean'],
                             bin_size=.2, show_hist=False)
    ffig.update_layout(showlegend=False,
                                   title_text='Distribution of the Mean Count of Each Month (Distplot)')
    ffig.show()
    
    print("Month Stats\n", pd.DataFrame(months_mv).rename(index={0: 'Mean', 1: 'Var', 2: 'Std'}))
In [15]:
# Compute and visualize fixed year month count stats
compute_vis_month_stat(year_month_count)
Month Stats
               1            2           3           4           5           6   \
Mean  234.428571   234.142857  233.714286  235.142857  235.000000  235.428571   
Var   954.530612  1018.122449  951.918367  844.693878  743.428571  699.102041   
Std    30.895479    31.908031   30.853174   29.063618   27.265887   26.440538   

              7           8           9           10          11           12  
Mean  235.857143  237.571429  236.714286  238.571429  240.285714   205.571429  
Var   807.265306  755.959184  714.775510  718.816327  712.489796  7724.816327  
Std    28.412415   27.494712   26.735286   26.810750   26.692504    87.890934  

There is a clear indication that 2023 is an abnormality in the trend in general from the plot; however, December is completely missing data, which is also drastically affecting our $\mu$ and $\sigma^2$ and by extention $\sigma$. We will need to fill in this value. To do this, the best we can do is analyze the trend and obtain the average change per month throughout the years between November and December, excluding 2023. Using the mean change, we can approximate the most likely scenario for December given the registration count we had for November.

In [16]:
# Calculate the mean change from Nov to Dec
from math import ceil, floor
n_d_changes = [year_month_count[year][12] - year_month_count[year][11] for year in year_month_count if year != 2023]
mean_n_d_change = np.mean(n_d_changes)
mean_n_d_change = ceil(mean_n_d_change)if mean_n_d_change >= 0 else floor(mean_n_d_change)

# Approximate December 2023
year_month_count_fix = deepcopy(year_month_count)
year_month_count_fix[2023][12] = year_month_count_fix[2023][11] + mean_n_d_change
dec_2023 = year_month_count_fix[2023][12]
    
print(f'Mean Change From Nov to Dec (excluding 2023): {mean_n_d_change}')
print(f'Filled Approximate of Dec Registrations: {dec_2023}')
Mean Change From Nov to Dec (excluding 2023): -1
Filled Approximate of Dec Registrations: 239
In [17]:
# Compute and visualize fixed year month count stats
compute_vis_month_stat(year_month_count)
Month Stats
               1            2           3           4           5           6   \
Mean  234.428571   234.142857  233.714286  235.142857  235.000000  235.428571   
Var   954.530612  1018.122449  951.918367  844.693878  743.428571  699.102041   
Std    30.895479    31.908031   30.853174   29.063618   27.265887   26.440538   

              7           8           9           10          11           12  
Mean  235.857143  237.571429  236.714286  238.571429  240.285714   205.571429  
Var   807.265306  755.959184  714.775510  718.816327  712.489796  7724.816327  
Std    28.412415   27.494712   26.735286   26.810750   26.692504    87.890934  

Now we have the statistics without missing data, where December is approximated using previous year's trends, let's go ahead and and re-plot the change bar plot and count scatter plots

In [25]:
annual_registration_change_fix
Out[25]:
{'0000 to 2017': 2247,
 '2017 to 2018': 235,
 '2018 to 2019': 298,
 '2019 to 2020': 193,
 '2020 to 2021': 194,
 '2021 to 2022': 43,
 '2022 to 2023': -213}
In [19]:
# Deep copy count dicts
cum_year_count_fix = deepcopy(cum_year_count)
annual_registration_change_fix = deepcopy(annual_registration_change)

# Fix count dicts to include December
cum_year_count_fix[2023] += dec_2023
annual_registration_change_fix['2022 to 2023'] += dec_2023
In [27]:
# Plot Cumulative sum of vehicle registered in each year
fig = go.Figure()
fig.add_traces(go.Scatter(x=list(cum_year_count_fix.keys()), 
                                               y=list(cum_year_count_fix.values())
                                              )
                                   )
fig.show()

# Plot change in a bar plot, omitting first change (no change before 2017)
fig = go.Figure()
fig.add_trace(go.Bar(
        x=list(annual_registration_change_fix.keys())[1:],
        y=list(annual_registration_change_fix.values())[1:],
        text=list(annual_registration_change_fix.values())[1:],
        textposition='auto',
))

# Change the bar mode to have negative change under 0
fig.update_layout(barmode='relative', title_text='Change in # of Registration of EVs Between Years')
fig.show()

# Plot months in all years on same plot
fig = go.Figure()

for year in year_month_count_fix:
    fig.add_traces(go.Scatter(x=list(year_month_count_fix[year].keys()),
                                             y=list(year_month_count_fix[year].values()),
                                             name=year
                                            )
                                 )

fig.show()

Now that December 2023 is fixed, we can see that there is still a drastic difference in the change of EV registration counts between 2022 and 2023. With all the information now correct, we can make a decision on the predictor we'd like to use. In this case, since how much new EVs is being registered is slowing down, it can be safe to assume we can use parabolic regression to predict the amount of new EVs we'll get per year. We can see this also if we plot it out. For our regression, we will now assume "years since 2017", i.e. 2018 = 1 for 1 year since 2017.

In [41]:
annual_reg_fix = {year-2017: sum(year_month_count_fix[year][m] for m in year_month_count_fix[year]) 
                  for year in year_month_count_fix}

# Plot registration count per year since 2017
fig = go.Figure()

for year in annual_reg_fix:
    fig.add_traces(go.Scatter(x=list(annual_reg_fix.keys()),
                                             y=list(annual_reg_fix.values()),
                                            )
                                 )

fig.show()

Maps¶

In [21]:
def compute_county_center(coords):
    '''
    Compute the center of each county given the list of coords by getting the max and min of the x or y coords
    for the county and dividing by 2 to get the center of the min and max of the x or y.
    '''
    def min_max_county_coords(sum_coords, test = False):
        min_x = np.inf
        max_x = -np.inf
        min_y = np.inf
        max_y = -np.inf
        
        for coord in sum_coords:
            if isinstance(coord[0], list):
                mm_coords = min_max_county_coords(coord, True) # recursively get min_max of nested coord lists(due to islands)
                    
                min_x = min(min_x, mm_coords[0][0])
                max_x = max(max_x, mm_coords[0][1])
                min_y = min(min_y, mm_coords[1][0])
                max_y = max(max_y, mm_coords[1][1])
            else:
                min_x = min(min_x, coord[0])
                max_x = max(max_x, coord[0])
                min_y = min(min_y, coord[1])
                max_y = max(max_y, coord[1])
            
        return [[min_x, max_x], [min_y, max_y]]
    assert isinstance(coords, list)
    
    mm_coords = min_max_county_coords(coords)
        
    return [sum(mm_coords[0]) / 2, sum(mm_coords[1]) / 2]
In [22]:
# Linear color map function for chropleth to plot continuous heatmap coloring.
linear_color = cmp.LinearColormap(
    ['#F2FFDB', '#8B0000'],
    vmin=min(total_ev_per_county['Count']), vmax=max(total_ev_per_county['Count']),
    caption='Total # of EVs in 2023'
)

# Create choropleth plot of EV Ownership in Seattle per county
ev_own_choro = folium.Map([47.751076, -120.740135], zoom_start=6.5)
wa_choro = folium.features.GeoJson(
    wa_county_outlines,
    style_function=lambda county: {
        'fillColor': linear_color(total_ev_per_county.loc[county['properties']['NAME']]['Count']),
        'fillOpacity': 0.9,
    },
    highlight_function=lambda county: {
        'fillColor': '#000000', 
        'fillOpacity': 0.20
    },
    tooltip=folium.features.GeoJsonTooltip(fields=['NAME'], aliases=['County'])
)

# Add choropleth properties
ev_own_choro.add_child(wa_choro)
ev_own_choro.add_child(linear_color)

# Add count markers on each county
for county in wa_county_outlines['features']:
    center_coord = compute_county_center(county['geometry']['coordinates'])
    county_name = county['properties']['NAME']
    county_ev_count = total_ev_per_county.loc[county_name]['Count']
    
    folium.Marker(location=[center_coord[1], center_coord[0]], icon=folium.DivIcon(
                      f"<div style='font-size: 15pt'>{county_ev_count}</div>")
                 ).add_to(ev_own_choro)

ev_own_choro
Out[22]:
Make this Notebook Trusted to load map: File -> Trust Notebook

TODO: MAY REMOVE¶

The datasets use abbreviations for the, so we'll need to convert the abbreviation to the actual names for folium/geopandas. Once they've been converted, all rows with any NaN will be dropped as they will affect the visualization and predictive models.

In [23]:
state_abbrev_raw = pd.read_csv('data/states_abbrev.csv')
state_abbrev_mapper = dict(zip(state_abbrev_raw['Abbreviation'], state_abbrev_raw['State']))
history_data['State'] = history_data['State'].replace(state_abbrev_mapper)
vehicle_data['State'] = vehicle_data['State'].replace(state_abbrev_mapper)

The GeoJson dataset is a simple dictionary. We will need to convert to a geopandas dataframe. Since we're only interested in the geo data, we can drop all columns except the County name and the geometry data.

In [24]:
wa_counties_gdf = gpd.GeoDataFrame.from_features(wa_county_outlines).drop(columns=['GEO_ID',
                                                                                         'LSAD',
                                                                                         'CENSUSAREA',
                                                                                         'COUNTY',
                                                                                         'STATE'])
wa_counties_gdf.rename(columns={'NAME': 'County'}, inplace=True)
wa_counties_gdf.head()
Out[24]:
geometry County
0 POLYGON ((-118.04115 46.77602, -118.04117 46.7... Adams
1 POLYGON ((-116.91750 45.99547, -116.94068 45.9... Asotin
2 POLYGON ((-119.29769 45.93574, -119.29879 45.9... Benton
3 POLYGON ((-121.15190 47.86676, -121.15251 47.8... Chelan
4 POLYGON ((-123.78127 48.15561, -123.77947 48.1... Clallam